Agentic Multimodal Intelligence: An Autonomous LLM Framework for Text and Image Data Analytics

Authors: Yarlagadda Srinivas, Marpudi Veerendra, Sanka Harshini, Gottapu Leela Vijay Kumar , Mr. M. Chiranjeevei

DOI Link: https://doi.org/10.22214/ijraset.2026.81867

Abstract

The research in this paper was focused on resolving the issue of assisting students in choosing the best suitable career path in the ever-evolving academic and professional environment. In general, traditional approaches to career guidance rely heavily on static evaluations or general counseling that may not take into consideration the academic background, aptitude, and interests of the student, as well as their specific academic stream. In order to overcome the above-mentioned drawback, the authors have successfully integrated the MERN technology stack along with hybrid approaches to both machine learning and deep learning to develop a smart system of career guidance. In this research paper, the authors have focused their attention on three different hybrid approaches to developing a smart system of career guidance: RF + CNN, KNN + Autoencoder, and XGBoost + BiLSTM. In order to train the system, a dataset of 10,000 students was used. A preprocessing pipeline was also integrated into the system that relies on one-hot encoding, numerical normalization, and TF-IDF-based textthe experimental results obtained from the system, it was clear that RF + CNN and XGBoost + BiLSTM have the highest accuracy of 99.25%, while KNN + Autoencoder also achieves an accuracy of 93.60%. In this research paper, the authors have successfully integrated a web-based system that assists students in choosing their Recommended Career Path, Primary Course, Job Options, and Alternative Paths.

Introduction

1. Retail Predictive Analytics & AI Transformation

The retail industry has been transformed by digitalization and massive data generation from e-commerce and omnichannel systems. Traditional intuition-based decision-making has been replaced by AI and machine learning–driven predictive analytics to forecast demand, personalize experiences, optimize inventory, detect fraud, and improve profitability. The COVID-19 pandemic accelerated adoption. Despite its benefits, many retailers still struggle to implement AI effectively. The study focuses on how predictive modeling improves retail efficiency and customer satisfaction using techniques like clustering, recommendation systems, and demand forecasting across sectors such as e-commerce, fashion, and supermarkets.

2. Mushroom Identification Using Deep Learning

Mushroom misidentification is a serious health risk because edible and poisonous species often look very similar, especially in rural areas lacking expert guidance. Environmental conditions further complicate manual identification. The study proposes a deep learning–based system using labeled image datasets and CNN architectures (e.g., EfficientNet-B3, ResNet-50) to classify mushrooms as edible or poisonous. Techniques like Grad-CAM improve interpretability, and confidence calibration improves reliability. The goal is to reduce poisoning incidents through automated, accurate, and explainable AI-based classification.

3. Agentic Multimodal Intelligence Framework (AI Analytics)

Organizations face challenges analyzing diverse multimodal data (text, images, structured data) using fragmented tools. Traditional analytics systems lack integration and require heavy manual effort. The proposed solution introduces an agent-based multimodal system combining LLMs and vision-language models (e.g., GPT-4 Vision, Gemini, LLaVA). It uses a Planner–Executor–Evaluator (PEE) architecture to automate reasoning, analysis, and validation. With tools like RAG, embeddings, and Chain-of-Thought reasoning, the system enables unified, real-time insights across multiple data types for applications like business intelligence and research.

4. Related Work on Multimodal AI Systems

Data analytics has evolved from traditional statistical methods to deep learning and multimodal AI. CNNs and transformers improved image and text understanding, while LLMs and VLMs enabled joint reasoning over multiple modalities. However, most existing systems remain task-specific, lack full autonomy, and require manual intervention. Research shows multimodal and hybrid models outperform single-modality systems, but real-world deployment remains limited. Recent work also explores agent-based AI systems for autonomous decision-making, but integrated, end-to-end multimodal platforms are still lacking.

5. Proposed Multimodal Agentic System (Methodology)

The proposed system is a unified AI framework that processes structured data, text, and images using LLMs and vision models. It applies preprocessing (cleaning, normalization, tokenization, image transformation) and converts all inputs into multimodal embeddings for unified reasoning. Instead of strict train-test dependency, it uses hybrid evaluation and benchmarking across models (e.g., GPT-4 Vision, Gemini, LLaVA). A Planner–Executor–Evaluator architecture manages task decomposition, execution, and validation, improving accuracy through iterative self-correction. The system supports real-time analytics, visualization, and deployment-ready insights.

Conclusion

The main idea of this research paper is to introduce an Agentic Multimodal Intelligence Framework which can perform autonomous data analysis on a single platform using structured data items, text-based data, and visual information. Thus, this new framework is designed to overcome limitations from traditional analytic systems that ordinarily rely on processing in only one modality and require substantial human input into the analysis process. This report’s primary contribution is the implementation of a Planner-Executor-Evaluator (PEE) architecture that facilitates intelligent task planning, execution, and self-evaluation. Using State-of-the-art Large Language Models (LLMs) and VisionLanguage Models (VLMs), this system can autonomously generate insight through complex reasoning tasks, produce insight, and validate results. Experimental data confirm that the performance of this new framework is superior to those found when using any individual model. Therefore, by integrating multi-modal embeddings with agentic reasoning, benchmarking, and evaluating a system by using an evaluator to reduce errors and improve the quality of results, an analytical system can produce consistent and interpretable results; thereby improving the overall quality of decision-making abilities. The proposed solution already shows good potential, but future updates can further enhance its capabilities and broaden its overall usage: Audio and Video Multi-Modality In future updates we could include audio and video processing allowing analysis of spoken words, video streams and other forms of multimedia content. Real-Time Data Integration Integrating with live data sources (APIs, IoT Sensors, Streaming) would allow us to provide real-time analysis and decision making. Cloud-Based Deployment Deploying it as a SaaS cloud-based scalable solution would give us the ability for multiple users to have access to the framework at the same time and would allow us to support Enterprise-level applications. Domain-Specific Fine Tuning Fine tuning a model for a specific domain (i.e. Health Care, Financial Services, Legal Analytics) will allow for improved accuracy and relevancy in those specific fields. Multi-Lingual Support Extending this service to include additional languages would make it accessible to more people globally.

References

[1] B. Jiang et al., \"Rational Reasoning in Multimodal Agents,\" in Proc. NAACL, 2025. [2] C. Xie et al., \"A Survey on Large Multimodal Agents,\" arXiv preprint, 2024. [3] Frontiers AI Team, \"LLM-Based Multimodal Data Analysis,\" Frontiers in AI, 2025. [4] Y. Zhang et al., \"Multi-Agent Orchestration Workflows for AI Pipelines,\" Springer AI & Society, 2024. [5] W. Liu et al., \"Overview of Multimodal LLM Architectures,\" National Science Review, 2024. [6] H. Chase, \"LangChain: Building Applications with LLMs through Composability,\" GitHub, 2023. [7] A. Radford et al., \"Learning Transferable Visual Models from Natural Language Supervision (CLIP),\" in Proc. ICML, 2021. [8] H. Liu et al., \"LLaVA: Visual Instruction Tuning,\" in Proc. NeurIPS, 2023. [9] Google DeepMind, \"Gemini: A Family of Highly Capable Multimodal Models,\" arXiv:2312.11805, 2023.

Copyright

Copyright © 2026 Yarlagadda Srinivas, Marpudi Veerendra, Sanka Harshini, Gottapu Leela Vijay Kumar , Mr. M. Chiranjeevei . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET81867

Publish Date : 2026-05-03

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here